Sam’s Website Presentation

12/09/2024

Sam English

Project 1 - Part 1 - Goals Scored at Each World Cup

This analysis involved a dataset in tidytuesday that fascinated me as a soccer enthusiast, titled “World Cup”. The data included information on results, location, matches played and goals scored between 1930 and 2018, the most recent World Cup when the dataset was created.

I chose to visualize the number of goals scored at each tournament dating back to 1930.

Visualization

# A tibble: 21 × 2
# Groups:   year [21]
    year goals_scored
   <dbl>        <dbl>
 1  1930           70
 2  1934           70
 3  1938           84
 4  1950           88
 5  1954          140
 6  1958          126
 7  1962           89
 8  1966           89
 9  1970           95
10  1974           97
# ℹ 11 more rows

Project 1 - Part 2 - Chocolate Ratings by Percent Cocoa

In this analysis, I chose another dataset that intrigued me, titled “Chocolate Ratings”. I visualized the data with an aim to find the optimal percent of cocoa in a chocolate bar, in terms of ratings.

Visualization

Here is the piped data that I used. I had to convert percentages to numerics so I could order the data correctly.

chocolate |>
  mutate(cocoa_percent = as.numeric(sub("%", "", cocoa_percent))) |>
  select(cocoa_percent, rating) |>
  group_by(cocoa_percent) |>
  arrange(cocoa_percent) |>
  summarise(ave_rating = mean(rating))
# A tibble: 46 × 2
   cocoa_percent ave_rating
           <dbl>      <dbl>
 1          42         2.75
 2          46         2.75
 3          50         3.75
 4          53         2   
 5          55         2.86
 6          56         3.25
 7          57         2.75
 8          58         3.12
 9          60         3.01
10          60.5       2.75
# ℹ 36 more rows

Project 2 - Netflix Title Analysis

In this project, I focused on data involving Netflix titles and made three visualizations. The purpose of this project was to use piping to organize data, and focus on the use of regular expressions to assist in this process, so it is easier to understand and present.

Visualization 1

Firstly, I wanted to find the most common words in titles of Netflix shows and TV shows, excluding filler words like “the” and “and”. This is the organized dataset that was used for my visualization.

# A tibble: 10 × 2
# Groups:   words [10]
   words         n
   <chr>     <int>
 1 love        152
 2 my          127
 3 you          81
 4 man          79
 5 christmas    78
 6 world        69
 7 story        67
 8 life         66
 9 movie        60
10 little       58

Visualization 2

Secondly, I compared the number of titles on Netflix that were movies or TV Shows, and their release dates.

# A tibble: 118 × 3
# Groups:   release_year [73]
   release_year type    count
          <dbl> <chr>   <int>
 1         1925 TV Show     1
 2         1942 Movie       2
 3         1943 Movie       3
 4         1944 Movie       3
 5         1945 Movie       3
 6         1946 Movie       1
 7         1946 TV Show     1
 8         1947 Movie       1
 9         1954 Movie       2
10         1955 Movie       3
# ℹ 108 more rows

Visualization 3

Finally, I decided to find the percent of titles that either contain a digit anywhere in their title, start with “the” or have the word “the” anywhere and showed this information on a bar graph. This turned out to be far more difficult than I expected. I found it hard to name the axis labels separate from the variable names, which were sometimes confusing, as you can imagine.

Project 3 - Generational Marijuana Use

This project involved running permutation tests assuming a null hypothesis to test whether a relationship exists between two variables. In my analysis, I tested the relationship between parental and child use of marijuana. This was my favorite project. I really enjoyed running the statistical analysis and demonstrating the data.

Visualization

I first found the percentage difference between students’ use of marijuana if their parents’ used it vs. if they didn’t. There is a 19.5% higher chance that a student uses marijuana if their parents’ did.

# A tibble: 445 × 2
   student parents
   <fct>   <fct>  
 1 uses    used   
 2 uses    used   
 3 uses    used   
 4 uses    used   
 5 uses    used   
 6 uses    used   
 7 uses    used   
 8 uses    used   
 9 uses    used   
10 uses    used   
# ℹ 435 more rows
[1] 0.1952381

Then I ran 1000 permutation tests to find the statistical likelihood of this happening without a relationship. i.e. assuming the null hypothesis of no relationship. On this graph you can see the results of the analysis, where the red line represents the proportional difference in the actual data.

Using this data I found a p-value of 0.

Project 4 - SQL Analysis of WAI Data for Auditory Research

In this project, I aimed to recreate a graph that demonstrates the relationship between the mean absorbance of sound and the frequency at which it is played. I then made my own graph, comparing mean absorbance across frequencies for people who identify as males, females or unknown sexes in studies conducted by Abur in 2004.

Visualization 1

Here is my attempt at recreating the graph shown at https://pmc.ncbi.nlm.nih.gov/articles/PMC7093226/#F1.

SELECT 
  Measurements.Identifier,
  COUNT(DISTINCT CONCAT(Measurements.SubjectNumber, Measurements.Ear)) AS Unique_Ears,
  PI_Info.AuthorsShortList,
  Measurements.Instrument,
  Measurements.Frequency,
  AVG(Measurements.Absorbance) AS MeanAbsorbance,
  CONCAT(PI_Info.AuthorsShortList, ' et al. N=', 
         COUNT(DISTINCT CONCAT(Measurements.SubjectNumber, Measurements.Ear)), ', ', Measurements.Instrument) AS LegendLabel
FROM Measurements
JOIN PI_Info ON Measurements.Identifier = PI_Info.Identifier
WHERE Measurements.Identifier IN ('Abur_2014', 'Feeney_207', 'Groon_2015', 'Lewis_2015', 'Liu_2008', 'Rosowski_2012', 'Shahnaz_2006', 'Shaver_2013', 'Sun_2016', 'Voss_1994', 'Voss_2010', 'Werner_2010')
  AND Measurements.Frequency >= 200
GROUP BY Measurements.Identifier, Measurements.Instrument, PI_Info.AuthorsShortList, Measurements.Frequency;

Visualization 2

In my second visualization I compare the mean absorbances across the study conducted by Abur in 2004. I differentiate between Men, Women and the studies where the sex of the subject was unknown.

Thank You